Hanoi University of Science and Technology
School of Information and Communication Technology
Master Thesis in Data Science
Semi-Supervised end-to-end polyp detection
NGUYEN HONG SON
son.nh212249m@sis.hust.edu.vn
Supervisor: Dr. Dinh Viet Sang
Hanoi 10-2023
Author’s Declaration
I hereby declare that I am the sole author of this thesis. The results in this work
are not complete copies of any other works.
STUDENT
Nguyen Hong Son
Contents
Contents
List of Figures
List of Tables
1 Introduction 1
1.1 Overview.................................. 1
1.2 Objectives ................................. 3
1.3 Main contributions ............................ 3
1.4 Outline of the thesis ........................... 3
2 Theoratical basis 5
2.1 Learning Type . . . ............................ 5
2.1.1 Supervised Learning ....................... 5
2.1.2 Unsupervised Learning ...................... 6
2.1.3 Semi-supervised Learning .................... 7
2.2 Relatedwork ............................... 7
2.2.1 Object Detection Problem . . .................. 7
2.2.2 Semi supervised Learning .................... 10
2.2.3 Semi-supervised Object Detection (SSOD) ........... 11
3 Method 12
3.1 Preliminary ................................ 12
3.1.1 ResidualBlock .......................... 12
3.1.2 Backbone used in CenterNet baseline .............. 13
3.1.3 FeaturePyramidNetwork(FPN) ................ 16
3.1.4 CenterNetmodelization ..................... 18
3.2 Improving Baseline model ........................ 19
3.2.1 Backbone ............................. 20
3.2.2 Head................................ 22
3.3 DenseTargetProducer(DTP)...................... 24
3.3.1 Pseudo-Labeling Framework ................... 24
3.3.2 DisadvantagesofPseudo-boxLabels .............. 25
3.3.3 Proposedmethod......................... 28
4 Experiments 33
4.1 Dataset .................................. 33
4.1.1 Dataset .............................. 33
4.1.2 Metrics .............................. 35
4.2 Implement details . ............................ 37
4.3 Results................................... 38
4.3.1 CenterNet++ improvement results ............... 39
4.3.2 DTPresults............................ 40
4.4 Ablation Studies . ............................ 41
5 Conclusion 44
Bibliography 45
List of Figures
1.1 Proportion of CRC compared to other cancer diseases
1
........ 2
2.1 Illustration about supervised, unsupervised and semi-supervised learn-
ing.
2
.................................... 5
2.2 Two output of Object detection model: boxes and categories
3
.... 8
2.3 Anchor boxes in the object detection task.
4
............... 9
3.1 Architecture of Residual Block ...................... 13
3.2 Architecture of ResNet18
5
........................ 14
3.3 Architecture of DLA
6
.......................... 15
3.4 An example of Hourglass network architecture used for segmentation
task.
7
................................... 16
3.5 FPN architecture
8
............................ 17
3.6 OverviewarchitectureofCenterNet++.................. 20
3.7 Four module types used in the backbone of CenterNet++ ....... 21
3.8 Architecture of ASF module ....................... 23
3.9 The overview of our proposed pipeline for unlabeled data compared
with existing pseudo-box based pipeline. For each iteration, Dense
Target Producer (DTP) is generated by the teacher model on unla-
beled images. DTPs then use for the student model to calculate the
unsupervised loss. The total loss is the sum of supervised loss and
unsupervised loss. Note that DTP does not need any postprocessing
steps..................................... 24
3.10 How object detection methods get positive samples: (b) Anchor-based
and anchor-free with NMS assign all pixels inside ground truth box
(c) Heatmap-based focus only on the center pixel of the ground truth
box.boxeswithheatmap-basedmethod ................ 26
3.11 Comparisons between (b) foreground pixels assigned by ground truth
boxesand(c)foregroundpixelsassignedbypseudo-boxes ...... 27
3.12 Example of describing heatmap values at the beginning and end of
the training process. Left: Groundtruth (GT) and Right: Prediction
(Pred).................................... 28
4.1 Some examples of images in the PolypsSet dataset ........... 34
4.2 Examples of IOU between two boxes .................. 35
4.3 Example of Precision-Recall curve. The mean Average Precision
(mAP) is the area under the curve. ................... 36
4.4 Examples of Mosaic augmentations ................... 38
4.5 Visualizations for PolypsSet 10% two-class. From left to right: With-
outDTP(supervised),withDTPandlabel............... 42
List of Tables
4.1 Comparison between CenterNet++ and other object detectors for
fully supervised learning on the single-class PolypsSet dataset. .... 39
4.2 Comparison between CenterNet++ and other object detectors for
fully supervised learning on the two-class PolypsSet dataset. ..... 40
4.3 Experimental results for semi-supervised learning setting on the Polyps-
Setdataset. ................................ 40
4.4 Comparisons on the single-class PolypsSet dataset with semi-supervised
learning setting using 10% of training data as labeled ......... 41
4.5 Experiments on different backbones of CenterNet++ on the two-class
PolypsSetdataset. ............................ 41
4.6 Effectiveness of ASF module and Mosaic augmentations on the single-
classPolypsSetdataset. ......................... 43
4.7 Effectiveness of using TEA compared to fixed thresholds. Experi-
ments were conducted on the single-class PolypsSet dataset with 1%
datausedaslabeled. ........................... 43
List of abbreviations
AI Artificial Intelligence
ANN Artificial Neural Network
CNN Convolutional Neural Network
CRC Colorectal Cancer
DL Deep Learning
DTP Dense Target Producer
FC Fully Connected
FPN Feature Pyramid Network
FPS Frame per second
FCOS Fully Convolutional One-Stage Object Detection
IOU Intersection Over Union
NMS Non Maximum Suppression
SSD Single Shot Detection
SSOD Semi-supervised Object detection
TEA Threshold Epoch Adaptive
YOLO You Only Look One
Chapter 1
Introduction
1.1 Overview
Colorectal Cancer (CRC) is one of the most common types of cancer today. When
considering cancer diseases specifically, it ranks third in terms of incidence and even
second in terms of mortality. It is estimated that in 2023, there will be approximately
150,000 new cases and 52,000 deaths in the United States [1]. In Vietnam, CRC
consistently ranks among the top 10 most common cancer diseases, with the fourth-
highest diagnosis rate for males and the third-highest for females. The incidence
rate in the population is about 10.1/100,000. Figure 1.1 provides a clearer view of
the number and proportion of CRC cases compared to other cancer diseases. The
danger of CRC lies in the fact that when symptoms begin to appear and the patient
decides to seek medical attention, the disease has already progressed to later stages.
CRC is divided into 5 stages, which increase as the disease worsens. If the disease is
detected in the early stages, the 5-year survival rate can be as high as 90%. However,
in reality, only about 4 out of 10 cases are detected in the early stages. Therefore, to
reduce the incidence and mortality rates, efforts are often made to detect and remove
early signs of precancerous abnormalities. Among these abnormalities, polyps inside
the digestive tract, mainly in the colon (referred to as polyps), are considered the
most significant cause.
Polyps are abnormal growth of tissue that protrudes from the lining of the digestive
tract. Polyps are typically divided into two types: neoplastic and non-neoplastic.
Non-neoplastic polyps are benign and can be further divided into hyperplastic, in-
flammatory, and hamartomatous types. Neoplastic polyps are malignant and have
the potential to develop into cancer. They can be further divided into adenomatous
and serrated types. To detect polyps, doctors use a method called gastrointesti-
nal endoscopy. This involves using an endoscope - a tube with a light and camera
1
Source: https://gco.iarc.fr/today/data/factsheets/populations/704-viet-nam-fact-
sheets.pdf
1
Figure 1.1: Proportion of CRC compared to other cancer diseases
1
attached that projects images onto a color TV screen - to examine the inside of
the digestive tract. The effectiveness of endoscopy depends on the skill of the doc-
tor performing the procedure. According to statistics [2], about 25% of polyps are
missed during endoscopy, which poses a significant risk to the patient. In order to
reduce this high-risk rate, attention is paid to two things:
Improve the quality of machines and tools used for endoscopy.
Using Artificial Intelligence (AI) or specifically Deep Learning (DL) models as
a diagnostic aid.
This second direction has shown a relatively significant effect, reducing the miss rate
by up to 50% [3].
The problem used to support colonoscopy is called Polyps detection. Its goal is to
locate all polyps in a frame or a video. DL models of Polyps detection task need to
use a large amount of labeled data during training in order to perform well. This
has faced a number of challenges such as:
Lack of availability of public datasets.
Labeling this data requires someone knowledgeable about polyps, usually a
doctor.
It is sometimes difficult for labelers to agree on the size and type of polyps in
a data sample.
In addition, the characteristics of polyps such as small size, diverse shapes, and
colors also pose challenges in the labeling process. Therefore, the amount of labeled
data used for polyp detection is quite limited. Conversely, the number of endoscopy
videos is very large and continuously increasing, which is considered a huge amount
2
of unlabeled data. This is where semi-supervised object detection (SSOD) methods
come into play. The models applied in SSOD are usually a type of well-known object
detection models [4, 5, 6]. These models all have at least one of two components,
Anchor and Non-maximum suppression (NMS). Although these components help
the model achieve good performance, they significantly reduce inference speed. A
new type of model that does not use Anchor or NMS, called heatmap-based end-
to-end [7, 8], has been proposed to address these issues. In this study, we applied
semi-supervised learning to a heatmap-based end-to-end model to take advantage
of both.
1.2 Objectives
The objective of this study is to effectively apply SSOD to a heatmap-based end-to-
end model in order to achieve a highly accurate and fast model. Considering specific
datasets, the model demonstrates superior results compared to existing methods.
We hope that our model can be applied to many endoscopes in hospitals.
1.3 Main contributions
The main contributions of this study:
We propose a novel heatmap-based end-to-end model called CenterNet++ to
improve the accuracy and inference speed over the original CenterNet model.
We propose a novel SSOD method called Dense Target Producer (DTP), which
can perform end-to-end without the need for any non-differentiable postpro-
cessing steps, and can be applied to heatmap-based end-to-end object detec-
tors. To my knowledge, this is the first attempt to apply SSOD to a heatmap-
based end-to-end model.
We propose a dynamic thresholding procedure called Threshold Epoch Adap-
tor (TEA) to adaptively filter unreliable pseudo labels based on the learning
status of the models.
Conduct experiments on a large benchmark dataset, namely PolypsSet. The
results show that our DTP improves the AP performance compared to the
supervised baseline model and outperforms other SSOD methods.
1.4 Outline of the thesis
The rest of this thesis is organized as follows:
3
Chapter 2 presents an overview and related work of this study’s field.
Chapter 3 describe in detail the method which apply SSOD to heatmap-based end-
to-end model.
Chapter 4 presents the content of the experiments, the results obtained, and some
ablation studies results.
Chapter 5 concludes the thesis.
4
Chapter 2
Theoratical basis
2.1 Learning Type
Based on the type of data input into the machine learning model, it can be cat-
egorized into several learning types, namely supervised, unsupervised, and semi-
supervised learning. Additionally, there exist other learning types such as self-
supervised, weakly-supervised, and reinforcement learning, which, however, are not
mentioned within the scope of this research.
Figure 2.1: Illustration about supervised, unsupervised and semi-supervised learn-
ing.
1
2.1.1 Supervised Learning
Supervised learning utilizes training data consisting of input-output pairs, where
the models learn from this labeled data to make predictions for future data. During
1
Source: https://blog.roboflow.com/what-is-semi-supervised-learning/
5
the learning process, the model predicts the output for the training data and then
calculates the adjustment amount based on the difference between the predicted out-
put and the actual label. This adjustment helps improve the model’s performance.
There are two main tasks in supervised learning:
Classification: The objective of the problem is to find a model that correctly
assigns data samples to their respective classes. The label in this problem is an
identifier (usually a numerical value) for the class to which the data belongs,
and the model aims to accurately predict this label for unseen data samples.
Classification has different use cases, such as spam filtering, customer behavior
prediction, and document classification.
Regression: The objective of the problem is to develop a model that synthe-
sizes the features of the input parameters to generate continuous real-valued
output results. It aids in the forecasting of continuous variables, such as Mar-
ket Trends and Home Prices.
2.1.2 Unsupervised Learning
Unsupervised learning uses machine learning algorithms to analyze and cluster unla-
beled datasets. These algorithms discover hidden patterns or data groupings without
the need for human intervention. Unsupervised learning models are utilized for three
main tasks—clustering, association, and dimensionality reduction.
Clustering is a technique that groups unlabeled data based on their similari-
ties or differences. Clustering algorithms are used to process raw, unclassified
data objects into groups represented by structures or patterns in the informa-
tion. Clustering algorithms can be categorized into a few types, specifically
exclusive, overlapping, hierarchical, and probabilistic.
Association Rules is a rule-based method for finding relationships between
variables in a given dataset. These methods are frequently used for market bas-
ket analysis, allowing companies to better understand relationships between
different products. Examples of this can be seen in Amazon’s “Customers
Who Bought This Item Also Bought” or Spotify’s ”Discover Weekly” playlist.
Dimensionality reduction. While more data generally yields more accurate
results, it can also impact the performance of machine learning algorithms (e.g.
overfitting) and it can also make it difficult to visualize datasets. Dimensional-
ity reduction is a technique used when the number of features, or dimensions,
in a given dataset, is too high. It reduces the number of data inputs to a
6
manageable size while also preserving the integrity of the dataset as much as
possible. It is commonly used in the preprocessing data stage.
2.1.3 Semi-supervised Learning
Semi-supervised learning is a type of machine learning that falls in between su-
pervised and unsupervised learning. It is a method that uses a small amount of
labeled data and a large amount of unlabeled data to train a model. The goal of
semi-supervised learning is to learn a function that can accurately predict the out-
put variable based on the input variables, similar to supervised learning. However,
unlike supervised learning, the algorithm is trained on a dataset that contains both
labeled and unlabeled data. Figure 2.1 general description of the type of input for
all three learning methods.
Semi-supervised learning is particularly useful when there is a large amount of un-
labeled data available, but it’s too expensive or difficult to label all of it.
2.2 Related work
2.2.1 Object Detection Problem
The objective of the Object detection problem is to determine the position and
category of objects within an image. The position of an object is represented by a
rectangular box, known as a bounding box, which includes the coordinates of the
top-left and bottom-right corners (or top-left corner and center). Regarding object
categorization, each object is assigned a unique identifier, and the model aims to
accurately predict this identifier for each bounding box representing an object. It
is evident that the model needs to simultaneously address both classification and
regression tasks, which are the primary focus of Supervised Learning. Figure 2.2
shows the bounding box and class type of objects in the image.
Object detection is a fundamental problem in the field of computer vision and has
been studied for a long time. Object detection poses significant challenges due to
the diversity in the number of object categories, variations in shape, color, size,
brightness, and even the angle of capture.
During the early period, image processing algorithms were applied to address the
object detection problem. Viola and Jones [9] utilized a sliding window approach
on images to extract regions of interest. These regions were then subjected to fea-
ture extraction and classification by the AdaBoot algorithm to determine if they
contained objects. To eliminate duplicate detections, a non-maximum suppression
2
https://www.v7labs.com/blog/object-detection-guide
7
Figure 2.2: Two output of Object detection model: boxes and categories
2
step was employed. This step removed overlapping bounding boxes by selecting
the most confident detection and suppressing others that had significant overlap.
Non-maximum suppression (NMS) is a post-processing algorithm used to eliminate
redundant or overlapping detections. Its primary purpose is to select the most confi-
dent and accurate bounding boxes while removing redundant detections. Although
this algorithm has the advantage of being very fast, it can be applied to applications
that require real-time, but it has many disadvantages. Its biggest disadvantage is
the lack of generalize. If one of these changes occurs, such as the size, angle of the
object, the background image, the algorithm will perform poorly. In addition, for
objects with relatively complex properties (color, shape, appearance patterns, ...)
the algorithm also encounters the same situation. Along with that, this algorithm
is also prone to many false positive objects.
After the advent of the Deep Learning era, most research efforts focused on this
domain. It began with two-stage models, with the prominent example being the
R-CNN family [4]. The first stage involved extracting regions likely to contain
objects, which were marked by bounding boxes. The subsequent stage took these
regions as input and performed classification to determine if each region contained
an object. These two stages were trained separately and independently. Two-stage
models exhibited relatively high accuracy, but they suffered from slow inference
speeds. This drawback prompted research into one-stage models which is a single
deep neural network with faster inference speeds. However, in the early period, one-
8
stage models had relatively poor accuracy, making two-stage models the preferred
choice.
The breakthrough came when anchor boxes were first introduced in YOLOv2 and
SSD [10, 11]. One-stage models made significant strides in accuracy. Anchor boxes
provided a breakthrough by allowing one-stage models to achieve considerable accu-
racy improvements. Anchor boxes are predefined bounding boxes of different sizes
and aspect ratios. They assist in the localization aspect by providing prior knowl-
edge about potential object locations. They also help address this variability by
representing a range of possible object sizes and shapes. Each anchor box is associ-
ated with a specific scale and aspect ratio. During the training phase, anchor boxes
are used to assign ground truth objects to the most suitable anchor boxes based on
the IoU (Intersection over Union) metric. This assignment process helps determine
which anchor boxes are responsible for detecting specific objects. In the inference
phase, the model predicts bounding box offsets and object class probabilities for
each anchor box. The anchor boxes serve as reference points, and the model adjusts
these predictions based on the anchor box dimensions and positions. Figure 2.3
describes how the anchor box works.
Figure 2.3: Anchor boxes in the object detection task.
3
Since then, a universal rule has emerged that Anchor boxes and NMS are always
necessary for object detectors. Despite their effectiveness, these two post-processing
steps also encounter some issues. Because Anchor boxes carry prior knowledge
about the data, when applied to a new dataset, the anchor boxes need to be re-
calculated. Additionally, anchor boxes make the model implementation relatively
complex. With NMS, although this algorithm is relatively simple, the need to cal-
culate the IOU of numerous box pairs makes this step quite slow. It increases the
inference time of the model significantly.
To experiment with the elimination of anchor boxes, FCOS [6] considers each pixel
on the output feature map as an anchor. The model predicts a box for each pixel
3
https://medium.com/@nikitamalviya/object-detection-anchor-box-vs-bounding-box-
bf1261f98f12
9
and optimizes the distances from the four edges of the labeled box to the positions
of those pixels. This is done for all pixels within the labeled box. With outstand-
ing results surpassing the anchor-based models of that time, FCOS concluded that
anchor boxes are not truly necessary.
Following that, several studies aimed to eliminate NMS, transforming the model into
an end-to-end architecture, meaning it does not require any non-differentiable post-
processing steps. One such model is DETR [12], which utilizes the Transformer, a
well-known architecture in the field of Natural Language Processing (NLP). This
model directly predicts N boxes and assigns each predicted box a label using the
bipartite matching algorithm with the Hungarian Algorithm. Another form of end-
to-end model is the heat-map-based model. These models consider an object as a
set of key points, such as CornerNet [13], which uses two corner key points (top left
and bottom right), and CenterNet [7], which uses three key points, including the
center.
For the polyp detection problem, the current methods [14] treat polyp as an object
and apply models of the object detection problem in general.
2.2.2 Semi supervised Learning
Semi-supervised learning is a training method that utilizes both labeled and unla-
beled data simultaneously. It was initially applied to image classification problems.
There are two main techniques commonly employed in semi-supervised learning:
Consistency-based and Pseudo labeling.
Consistency-based: During training, we handle labeled and unlabeled data
points differently: for points with labels, we optimize using traditional super-
vised learning, calculating loss by comparing our prediction to our label; for
unlabeled points, we want to enforce that similar data points have similar pre-
dictions. With augmentations, we can create artificially similar data points.
For a given image x, we have augmented image ˆx, and our model should make
similar predictions for x and ˆx. The total different values of the two predic-
tions can be counted as unsupervised loss [15, 16]. Then the general objective
function will be equal to the sum of supervised and unsupervised loss.
Pseudo labeling: Involves assigning pseudo labels to unlabeled data based
on the predictions made by the model. The model is first trained on the
labeled data, and then it uses this trained model to make predictions on the
unlabeled data. The predictions are converted into “one-hot” vectors, where
the most confident class becomes the pseudo-labels, and the model is further
trained on the combined labeled and pseudo-labeled data [17, 18].
10
Both consistency-based methods and pseudo labeling techniques have been success-
fully applied in various domains to leverage the benefits of unlabeled data. These
semi-supervised learning approaches have shown promising results by effectively
utilizing both labeled and unlabeled data to enhance the performance of machine
learning models.
2.2.3 Semi-supervised Object Detection (SSOD)
Currently, the existing methods in Semi-Supervised Object Detection (SSOD) [19,
20, 21] are based on the knowledge distillation technique, where two models, namely
the student and the teacher, are used with different initializations. By combining
this technique with the two main directions in semi-supervised learning, there are
also two popular directions in SSOD: Pseudo-boxes and Consistency-based. How-
ever, the object detection problem, with its output being a set of boxes containing
the positions and classes of objects in each box, is different from the classification
problem where the output is a single representation of the image’s class. There-
fore, we cannot directly apply the methods of semi-supervised learning in general to
SSOD but need to make suitable adaptations.
In the pseudo-boxes approach, STAC [22] trains a teacher model using labeled data
and then uses it to generate pseudo-boxes for unlabeled data. After discarding boxes
with low confidence scores using a threshold, the remaining boxes are used as pseudo-
labels for the unlabeled data. The student model is then trained on both labeled and
unlabeled data. However, selecting an appropriate threshold is a challenging issue
in this method. A high threshold will remove many high-quality boxes, while a low
threshold will allow low-quality boxes to be included. To address this, SoftTeacher
[23] uses a two-phase Faster RCNN model that separates the process of selecting
boxes with good localization and high confidence scores, and combines them to
obtain a suitable set of candidate boxes. Other methods like [21, 19] use YOLOv5,
a one-stage anchor-based model, as a baseline for box selection techniques to improve
the quality of pseudo-boxes.
In the consistency-based approach, DenseTeacher [20] uses FCOS, a one-stage anchor-
free model, to directly transform dense feature map output into pseudo-labels. This
method retains the top k pixels with the highest scores on the output map and
suppresses the remaining pixels to a value of 0.
Both of these approaches have achieved state-of-the-art scores on various bench-
marks such as COCO [24]andPascalVOC[25]. However, none of the above methods
works for heatmap-based end-to-end detectors.
11
Chapter 3
Method
This chapter provides a detailed explanation of the theoretical aspects of the re-
search. Chapter 3.1 presents the modeling of the object detection problem in
the form of a heatmap-based approach, specifically the original CenterNet model.
Chapter 3.2 discusses the improvements made to the original CenterNet model to
achieve the CenterNet++ model with better results. Building upon these advance-
ments, Chapter 3.3 presents the Dense Target Producer method, which applies Semi-
Supervised Object Detection (SSOD) to the CenterNet++ model.
3.1 Preliminary
3.1.1 Residual Block
In traditional network architectures, layers are connected sequentially to each other.
In networks that utilize the Residual Block, a layer can be connected directly to the
layer right after it or at a certain number of layers away.
The Residual Block helps the network become deeper while maintaining high accu-
racy. According to the universal approximation theorem, any function can be rep-
resented by a neural network with an appropriate number of layers. As the number
of layers increases, the network can represent more complex functions. Therefore,
increasing the number of layers or the depth of the network is a way to enhance the
accuracy of the model. However, it has been observed that when the depth reaches
a certain level, the accuracy does not increase and may even decrease. This issue
arises from the vanishing gradient problem. This issue occurs when the gradient
signals propagated through the backpropagation process become increasingly weak
(approaching zero) as they pass through each layer, leading to a situation where
the layers closer to the input do not receive significant updates in their parameters,
especially in very deep networks.
As depicted in Figure 3.1, the blocks of the network learn a distribution H(x)for
12
Figure 3.1: Architecture of Residual Block
the input x of the block. In this case, the residual can be defined as:
R(x)=Output Input = F (x) x hay F (x)=R(x)+x
For the Residual Block, learning H(x) is essentially learning R(x) because x is
already carried forward to the output. Learning the residual is considered easier
compared to directly learning the output. In the worst case, we can learn F (x)to
be equivalent to x by setting R(x) to zero. Furthermore, F (x) contains x, which
means that the blocks in the deepest layers preserve the information from the earlier
layers. Consequently, the backpropagation process does not encounter the vanishing
gradient problem when propagating gradients through the network.
3.1.2 Backbone used in CenterNet baseline
CenterNet baseline uses ResNet, DLA and Hourglass networks for the backbone.
ResNet: Researchers believe that increasing the depth of a network can lead
to better learning performance. This intuition is reasonable because as the
network depth increases, the number of parameters and the learning capacity
also increase. However, in practice, when experimenting with the VGG net-
work, it was observed that this belief is not entirely true due to the vanishing
gradient problem.
In 2015, ResNet emerged as a breakthrough in network architecture design
and has made significant advancements in image classification tasks. The key
innovation of ResNet is its use of residual connections, also known as skip con-
13
nections or shortcut connections. These connections allow information to flow
directly from one layer to another without passing through a series of interme-
diate layers. By incorporating residual connections, the network became much
deeper compared to previous architectures, resulting in a significant improve-
ment in accuracy. Moreover, with a simple architecture composed of stacked
residual blocks, it became easy to add or remove blocks to create appropriate
depths. A residual block consists of multiple stacked convolutional layers with
shortcut connections.
The three most commonly used depths are 50, 101, and 152. ResNet is one
of the most popular networks in computer vision tasks due to its simple (easy
to implement) yet highly effective network architecture. Although ResNet can
be customized with different numbers of layers, the most popular versions are
ResNet18, ResNet50, and ResNet101. Figure 3.2 provides a detailed illustra-
tion of the ResNet18 model’s architecture.
Figure 3.2: Architecture of ResNet18
1
DLA: is a deep neural network architecture that aims to address the challenges
of designing efficient and accurate models for object detection and semantic
segmentation tasks. For similar reasons to ResNet, and with the desire to
reduce the number of parameters and memory usage while maintaining good
accuracy, DLA was introduced in 2018. The main idea behind DLA is to lever-
age hierarchical feature representations by aggregating features from different
layers of the network. This aggregation process allows the model to capture
both low-level fine-grained details and high-level semantic information, leading
to improved performance.
1
https://www.pluralsight.com/guides/introduction-to-resnet
14
At that time, most skip connections in networks were relatively simple, referred
to as ”shallow.” DLA proposed architectural forms to make these skip connec-
tions deeper. DLA is based on two main architectural forms: Iterative Deep
Aggregation (IDA) and Hierarchical Deep Aggregation (HDA). These forms
were developed to enhance the depth and effectiveness of skip connections in
the network. In IDA, aggregation begins at the shallowest, smallest scale and
then iteratively merges deeper, larger scales. In this way, shallow features
are refined as they are propagated through different stages of aggregation. In
HDA, blocks and stages in a tree are merged to preserve and combine feature
channels. With HDA shallower and deeper layers are combined to learn richer
combinations that span more of the feature hierarchy. While IDA effectively
combines stages, it is insufficient for fusing the many blocks of a network, as
it is still only sequential.
DLA models have demonstrated efficiency in terms of both computational cost
and memory requirements, making them suitable for real-time and resource-
constrained applications. Figure 3.3 shows the architecture of DLA.
Figure 3.3: Architecture of DLA
2
Hourglass: which was proposed in 2016 is a deep neural network model
designed for accurate and efficient human pose estimation. The goal of human
pose estimation is to predict the locations of body joints or key points in
an image. The Hourglass architecture addresses this task by using a series
of stacked hourglass modules, each of which consists of an encoder-decoder
structure.
The hourglass module is named after its shape, which resembles an hourglass.
It is composed of several key components: residual blocks, pooling layers,
skip connections, and intermediate supervision. Residual blocks, based on
the ResNet architecture, form the backbone of each hourglass module. These
blocks help capture and propagate important features throughout the net-
work, allowing for better representation learning. Pooling layers are used to
2
https://sh-tsang.medium.com/review-dla-deep-layer-aggregation-581b543c8a9d
15
downsample the feature maps, reducing their spatial dimensions while increas-
ing their depth. This downsampling helps in capturing features at different
scales and resolutions. Skip connections are introduced to preserve fine-grained
spatial information. They connect the encoder and decoder parts of the hour-
glass module, allowing the network to access both low-level and high-level
features simultaneously. Skip connections enable the model to refine pose esti-
mations by leveraging information at multiple scales. Intermediate supervision
involves adding supervision at multiple stages within the hourglass module.
By introducing intermediate supervision, the network receives feedback and
gradient signals at different depths, aiding in better training and improving
performance. The original Hourglass architecture consists of several stacked
hourglass modules, with each module refining the pose estimation from the
previous one. The final prediction is obtained from the output of the last
hourglass module.
The Hourglass architecture has proven to be highly effective for human pose
estimation, achieving state-of-the-art results on various benchmark datasets.
Its ability to capture multi-scale information, preserve spatial details, and
leverage intermediate supervision makes it a powerful tool for this task. Figure
3.4 show an example of Hourglass network architecture used for segmentation
task.
Figure 3.4: An example of Hourglass network architecture used for segmentation
task.
3
3.1.3 Feature Pyramid Network (FPN)
In the object detection task, detecting small objects has always been a significant
challenge. To address this issue, a solution has been proposed, which involves using
images at different scales (resolutions). By doing so, when images have larger sizes,
smaller objects are also enlarged and become easier to detect. However, processing
multiple images can be time-consuming and memory-intensive, making it impractical
3
https://medium.com/m/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%
2
hourglass-networks-to-understand-human-poses-1e40e349fa15
16
for training and suitable only for inference. To mitigate this drawback, instead
of using multiple images through the same convolutional network, one approach
is to pass an image through the convolutional network and extract feature maps
at different scales. Another issue that arises is that the feature maps obtained
from the initial layers may not sufficiently represent the image features, resulting in
suboptimal detection performance.
The Feature Pyramid Network (FPN) is designed to address both of these challenges
comprehensively. FPN consists of a series of operations that combine multi-scale fea-
ture maps to produce high-quality feature representations at different levels. Figure
3.5 illustrates the design of the FPN module, which demonstrates how the multi-
scale feature maps are integrated.
Figure 3.5: FPN architecture
4
FPN incorporates both bottom-up and top-down pathways. The bottom-up path-
way involves performing regular convolutions, where the feature map size decreases
as we move upward, but the extracted features become more abundant. The top-
down pathway focuses on reconstructing larger feature maps from smaller ones while
preserving rich information.
In the bottom-up pathway, the input image undergoes convolutional operations,
resulting in a series of feature maps with decreasing spatial dimensions. These
feature maps capture hierarchical representations of the input, with higher-level
feature maps containing more abstract and semantic information.
The top-down pathway complements the bottom-up process by utilizing lateral con-
nections. It takes smaller feature maps from the bottom-up pathway and upsamples
them to match the dimensions of the corresponding feature maps in the higher levels.
This reconstruction process combines the fine-grained details from the bottom-up
pathway with the semantic information from the top-down pathway. The result is a
4
https://jonathan-hui.medium.com/understanding-feature-pyramid-networks-for-
object-detection-fpn-45b227b9106c
17
set of feature maps at different scales, where each map contains rich and meaningful
information.
By integrating both pathways, FPN creates a feature pyramid that encompasses
multi-scale representations of the input image. This enables the network to capture
objects of various sizes and accurately detect them. The design of FPN balances the
trade-off between spatial resolution and semantic information, leading to enhanced
object detection performance.
3.1.4 CenterNet modelization
Let I R
W ×H×3
be an input image of width W and height H;
ˆ
Y ∈{0, 1}
W
R
×
H
R
×C
be
the output heatmap, where R is the output stride and C is the number of classes. We
use the default output stride of R = 4. The output stride downsamples the output
prediction by a factor R.
ˆ
Y
x,y,c
= 1 if the pixel (x, y) is the center of an object of
class c;and
ˆ
Y
x,y,c
= 0 otherwise. Each ground truth object center p R
2
has a low-
resolution equivalent ˜p =
p
R
. Then, all ground truth keypoints are splatted onto
aheatmapY [0, 1]
W
R
×
H
R
×C
using a Gaussian kernel Y
xyc
=exp
(x˜p
x
)
2
+(y ˜p
y
)
2
2σ
2
p
where σ
p
is an object size-adaptive standard deviation. Each object has a σ
p
which
is the radius of the inscribed circle of its bounding box. If two Gaussians of the
same class overlap, the element-wise maximum is taken.
The training objective for heatmap is modified focal loss:
L
k
=
1
N
x,y,c
(1
ˆ
Y
x,y,c
)
α
log(
ˆ
Y
x,y,c
)ifY
x,y,c
=1
(1 Y
x,y,c
)
β
(
ˆ
Y
x,y,c
)
α
otherwise
log(1
ˆ
Y
x,y,c
)
(3.1)
where α =2andβ = 4 are hyperparameters of the focal loss and N is the number of
objects in image I. To recover the discretization error caused by the output stride,
the model additionally predicts a local offset
ˆ
O
R
W
R
×
H
R
×2
for each center point.
All classes c share the same offset prediction. The offset is trained with an L1 loss:
L
off
=
1
N
p
|
ˆ
O
˜p
(
p
R
˜p)| (3.2)
The supervision acts only at keypoints locations ˜p, all other locations are ignored.
Let s
i
=(w
i
,h
i
) be the width and height of the bounding box of the object i.Its
center point of the object i lies at p
i
. The single object size map
ˆ
S R
W
R
×
H
R
×2
is
18
predicted for all object categories, and used L1 loss:
L
size
=
1
N
N
i=1
|
ˆ
S
p
i
s
i
| (3.3)
We do not normalize the scale and directly use the raw pixel coordinates. We instead
scale the loss by a constant λ
size
size. The overall training loss is:
L
det
= L
k
+ λ
size
L
size
+ λ
off
L
off
(3.4)
In all experiments, we set λ
size
=0.1andλ
off
=1.0. We use a single network to
predict the keypoints
ˆ
Y , offset
ˆ
O, and size
ˆ
S. The network predicts a total of C +4
outputs at each location.
At inference time, we first extract the peaks in the heatmap for each category
independently. We detect all responses whose value is greater or equal to its 8-
connected neighbors and keep the top 100 peaks. Each keypoint location is given
by integer coordinates (x
i
,y
i
). We use the keypoint values
ˆ
Y
x
i
y
i
c
as a measure of its
detection confidence and produce a bounding box at the location:
(x
i
+ δx
i
w
i
/2,y
i
+ δy
i
h
i
/2,
x
i
+ δx
i
+ w
i
/2,y
i
+ δy
i
+ h
i
/2)
where (δx
i
y
i
)=
ˆ
O
x
i
y
i
is the offset prediction and (w
i
,h
i
)=
ˆ
S
x
i
y
i
is the size
prediction. All outputs are produced directly without the need for NMS or other
post-processing. The peak keypoint extraction serves as a sufficient NMS alternative
and can be implemented efficiently using a 3 × 3 max pooling operation.
3.2 Improving Baseline model
Current one-stage object detection models are typically composed of two main com-
ponents: theBackboneandtheHead. TheBackboneisadeepneuralnetwork
responsible for feature extraction from the input image. The features of the image
are represented by feature maps with a relatively large dimension. These feature
maps are the outputs of different layers in different stages of the backbone, so they
have different sizes and store different information about the objects. Common back-
bones are often models used for image classification tasks, as they are trained on the
19
large ImageNet dataset, enabling them to extract good features. The feature maps
output from the backbone are then passed to the Head module, which is another
different deep neural network. This is the distinguishing part among different types
of models. Therefore, both the backbone and head significantly influence the per-
formance of the model. Improving these two components contributes to enhancing
the model’s results.
3.2.1 Backbone
The original CenterNet model utilizes ResNet, DLA, and Hourglass as backbones
[26, 27, 28] along with FPN to generate feature maps at various resolutions. However,
since the introduction of these backbones, a considerable amount of time has passed,
and more advanced backbones have emerged, surpassing them in terms of both
accuracy and inference speed.
Figure 3.6: Overview architecture of CenterNet++.
Through a number of experiments detailed in the next chapter, we have decided
to use an improved architecture of the YOLOv5 [5], which is used in YOLOv8 [29]
as the backbone. There are four main types of blocks used in YOLOv8: Conv,
SPPF, BottleNeck, and C2f. The Backbone is a series of convolutional layers that
extract relevant features from the input image. The SPPF layer and the subsequent
convolution layers process features at a variety of scales, while the Upsample layers
increase the resolution of the feature maps. The C2f module combines high-level
features with contextual information to improve detection accuracy. The architec-
ture of the backbone is designed to be fast and efficient, while still achieving high
20
detection accuracy.
Figure 3.7: Four module types used in the backbone of CenterNet++
Based on Figure 3.7, we can see details about these blocks:
Conv: is a fundamental component that combines three layers: Convolutional,
BatchNormalization, and an activation function. In this case, the activation
function employed is the Sigmoid Linear Unit (SiLU), which is defined by the
formula x sigmoid(x). SiLU has been claimed to provide better performance
than ReLU which has been used in ResNet.
SPPF: is a pooling layer that removes the fixed-size constraint of the net-
work, i.e. a CNN does not require a fixed-size input image. SPPF utilizes
three MaxPooling layers with different kernel sizes to highlight information
within feature maps, considering various receptive fields. These MaxPooling
layers divide the feature maps into sub-regions and extract the most dominant
features within each region. The information extracted by the MaxPooling lay-
ers is then fused with the original feature maps. This fusion process combines
the highlighted features with the existing information, striking a balance be-
tween emphasizing important features and retaining crucial information from
the original feature maps.
BottleNeck: is inspired by the concept of bottleneck residual blocks intro-
duced in the ResNet architecture. This block is a combination of two Conv
blocks and a skip connection. The first 3x3 convolution reduces the number
of input channels while the subsequent 3x3 convolution operates on the re-
duced channel dimension, enabling the network to extract more complex and
abstract features. BottleNeck block helps the model deeper but still keeps
computational costs low. This block aims to strike a balance between model
21
complexity and computational efficiency while maintaining high-quality fea-
ture extraction.
Cf2: Among four types of block, Conv, SPPF, and BottleNeck have structures
similar to those in YOLOv5 while the C2f block is different. This block is a
combination of Conv and BottleNeck blocks. In the C2f block, the features
map input is passed through a Conv block and then split into two feature maps
with the same channels, each fitted to a branch. The second branch passes
through 3 or 6 BottleNeck blocks. The output of all BottleNeck blocks is then
aggregated with the first branch to obtain a single output features map. This
features map then passes through a Conv block to obtain the final output
of the C2f block. Attention modules are also included within the C2f block
to help the network attend to important features and suppress less relevant
information.
The FPN module, commonly used in various Object Detection models, has been
replaced with similar connections, such as those found in DLA. This replacement
enhances the complexity of skip connections, resulting in improved information ag-
gregation. As a result, the FPN module has been eliminated from the model.
With the design of C2f blocks, the model is much deeper than old backbones with
an FPN module while maintaining the number of parameters low. With unchanged
architecture, the model has the ability to scale up or down at 2 different factors:
depth and width. By increasing or decreasing the number of channels in Conv
blocks, the model width can be adjustable. The model depth can be changed by the
number of BottleNeck blocks inside the C2f block. Within a single C2f block, if we
change the number of BottleNeck blocks from 3 to 6, the total depth of the model
increases by 18.
Instead of using the FPN module, the modified backbone incorporates C2f blocks,
Conv blocks, and Upsample layers to generate three feature maps at different scales.
These feature maps have strides of 4, 8, and 16 compared to the input image,
respectively. These three feature maps serve as inputs to the Head section of the
model. By providing multiple scales of feature maps, the model becomes more
robust in detecting objects of different sizes, while also maintaining a high level of
contextual information.
3.2.2 Head
In the YOLO series, predictions are made at three outputs across different scales,
which helps the model achieve better learning performance for diverse object sizes.
However, for heatmap-based models like CenterNet++, a single-scale prediction is
22
enough. This becomes reasonable as polyp sizes are less diverse. To combine these
three outputs into a single input feature map, the Adaptive Scale Fusion (ASF)
module [30] is employed.
Figure 3.8: Architecture of ASF module
Different from most of the other methods that fuse the features of different scales by
simply cascading or summing up, ASF is designed to dynamically fuse the features
of different scales. As shown in Figure 3.8, the features of different scales are scaled
into the same resolution before being fed into the ASF. Assuming that the input
feature maps consist of N feature maps X
R
N×C×H×W
= {X
i
}
N1
i=0
,whereN is
set to 3.
Firstly, we concatenate the scaled input features X,andthena3× 3 convolutional
layer is followed to obtain an intermediate feature S
R
N×C×H×W
.
Secondly, the attention weights A
R
N×H×W
can be calculated by applying a
spatial attention module to the feature S.
Thirdly, the attention weights A can be split into N parts along the channel dimen-
sion and weighted multiply with the corresponding scaled feature to get the fused
feature F
R
N×H×W
.
In this way, the scale attention is defined as:
S = Conv(concat([X
0
,X
1
, ..., X
N1
]))
A = SpatialAttention(S)
F = concat([A
0
X
0
,A
1
X
1
, ..., A
N1
X
N1
])
where concat indicates the concatenation operator; Conv represents the 3 × 3con-
volutional operator; Spatial Attention indicates a spatial attention module, which
is illustrated in Figure 3.8. The spatial attention mechanism in the ASF makes the
23
attention weights more flexible across the spatial dimension.
Single output feature map of ASF is then passed through a head of 3 Conv blocks
before being separated into three lightweight branches corresponding to three out-
puts as in CenterNet original [8]. Fig. 3.6 illustrates the overall architecture of
CenterNet++.
3.3 Dense Target Producer (DTP)
With the CenterNet++ model as the baseline, the Dense Target Producer (DTP) is
proposed to apply SSOD to this model. This section presents the general pipeline
of SSOD methods, highlights the limitations of existing methods, and provides a
detailed exposition of the DTP approach. Additionally, it explains why DTP is
more suitable for heatmap-based models compared to previous methods.
Figure 3.9: The overview of our proposed pipeline for unlabeled data compared
with existing pseudo-box based pipeline. For each iteration, Dense Target Producer
(DTP) is generated by the teacher model on unlabeled images. DTPs then use
for the student model to calculate the unsupervised loss. The total loss is the
sum of supervised loss and unsupervised loss. Note that DTP does not need any
postprocessing steps.
3.3.1 Pseudo-Labeling Framework
Recent methods have commonly employed a shared pipeline with the knowledge
distillation technique. Figure 3.9 illustrates an iteration of the training process.
Each step in an iteration is performed as follows:
24
A data mini-batch containing both labeled and unlabeled images is randomly
sampled.
The unlabeled data in the batch after doing some augmentations is fed into
the teacher model to generate pseudo-labels. The teacher model and student
model are identical models but initialized differently. During training, the
weights of the teacher model are updated using exponential moving averages
(EMA) based on the student model.
For the labeled data in the batch, it is fed into the student model to perform
vanilla training and calculate the supervised loss, denoted as L
s
. The unla-
beled data continues to do some different augmentations and is fed into the
student model to generate predictions. The unsupervised loss, denoted as L
u
,
is computed based on these predictions and the pseudo-labels generated by
the teacher model in the previous step.
The total loss is computed as a weighted sum of the supervised loss and unsu-
pervised loss, and it is used to optimize the student model. Subsequently, the
student model updates the weights for the teacher model in an EMA manner.
The update equation for teacher model weights is:
weights
teacher
= decay weights
teacher
+(1 decay) weights
student
Here, decay is a hyperparameter between 0 and 1 that controls the rate of
decay or smoothing. A higher value of decay gives more weight to recent
updates, while a lower value gives more weight to historical updates.
The overall loss function is calculated by:
L = L
s
+ λ
u
L
u
(3.5)
In most existing methods, the unsupervised loss L
u
is calculated with pseudo-boxes.
However, using processed boxes as pseudo-labels can be inefficient and sub-optimal
for SSOD.
3.3.2 Disadvantages of Pseudo-box Labels
According to Zhou et al [20], the pseudo-boxes method faces several issues that
negatively impact the performance of the model. These issues are as follows:
Threshold selection: The teacher model plays a role in generating pseudo-
labels for unlabeled data. An important issue is selecting an appropriate
25
threshold to discard low-quality boxes while retaining high-quality ones. How-
ever, finding an optimal threshold is relatively challenging. Setting the thresh-
old too high may result in the removal of many high-quality boxes, leaving
the student model with insufficient learning material. Conversely, setting the
threshold too low may allow many low-quality boxes to be included, signifi-
cantly affecting the quality of the pseudo-labels.
IoU threshold selection for NMS: Non-Maximum Suppression (NMS) is
commonly used in object detection models. Similar to the threshold for box
filtering, NMS also requires a threshold. During NMS, if two boxes of the
same class have an IoU greater than the threshold, the box with a lower score
is eliminated. Selecting this threshold is another challenging task.
Inconsistent labeling: Current pseudo-boxes methods typically convert the
filtered high-quality boxes into pseudo-labels for final supervision. However,
these pseudo-boxes are often not accurate in terms of their localization. If they
are used conventionally, the model would learn from these incorrect boxes,
which would negatively impact the model’s quality.
Compared to anchor-based and anchor-free + NMS models, heatmap-based detec-
tors do not require NMS, thus avoiding the issue of selecting its IoU threshold.
However, when applying current pseudo-boxes methods to heatmap-based models,
several issues arise that prove this approach is ineffective. The following issues can
be highlighted:
(a) Raw Image (b) Positive samples
with anchor-based and
anchor-free with NMS
methods
(c) Positive samples with
heatmap-based method
Figure 3.10: How object detection methods get positive samples: (b) Anchor-based
and anchor-free with NMS assign all pixels inside ground truth box (c) Heatmap-
based focus only on the center pixel of the ground truth box. boxes with heatmap-
based method
26
Filter high-quality boxes: As shown in Figure 3.10, anchor-based or anchor-
free + NMS models consider all pixels within an object as positive points
during training. Therefore, during inference, these models generate multiple
boxes to represent the same object. These boxes have decreasing scores, and
larger objects tend to have more representative boxes. This leads to the pos-
sibility that even after filtering candidate boxes using thresholds and NMS,
there may still exist one or more boxes representing the object. In contrast,
heatmap-based models represent objects with a single box. Thus, during the
box filtering process, if this box is discarded, the object is considered as back-
ground during training. Clearly, this significantly affects the performance of
the model.
Effective of inaccurate pseudo-boxes: Heatmap-based models often use a
Gaussian kernel to represent objects on feature maps. The objective of these
models is to learn this heatmap representation. On the heatmap, the closer the
pixel is to the center, the higher the pixel’s value and its importance. In this
case, if the obtained pseudo-boxes have inaccurate positions, they will create an
erroneous kernel, meaning both the object center and the surrounding positions
are incorrect. This is more critical compared to anchor-based or anchor-free +
NMS models, as the pixels in the region between pseudo-boxes and true boxes
are still accurately learned.
(a) Raw Image (b) Ground truth posi-
tive samples
(c) Pseudo-boxes posi-
tive samples
Figure 3.11: Comparisons between (b) foreground pixels assigned by ground truth
boxes and (c) foreground pixels assigned by pseudo-boxes
Effective of inaccurate pseudo-boxes to CenterNet++: Thelossfunc-
tion of CenterNet++ highlights the position of the object center while setting
the probabilities of other positions to 0. Therefore, it is very easy to get the
boxes from the inaccurate center. Additionally, CenterNet++ directly learns
the length and width information of the box, so it is relatively heavily in-
fluenced by boxes with inaccurate positions. As shown in Figure 3.11, even
27
though the pseudo-boxes have IoU with ground truth boxes higher than 0.5,
the center of the object is inaccurate. It leads to the phenomenon that the
model will learn to predict the wrong center and try to minimize the proba-
bility of the ground truth center coming to zero.
All of the above issues make the application of pseudo-boxes-based SSOD meth-
ods to heatmap-based models ineffective. Therefore, this research aims to apply
consistency-based SSOD methods instead.
3.3.3 Proposed method
Let (
ˆ
Y
s
,
ˆ
Y
t
), (
ˆ
S
s
,
ˆ
S
t
), (
ˆ
O
s
,
ˆ
O
t
) be the output pairs from the student and teacher
networks, respectively.
In the supervised step, Focal Loss is responsible for maximizing the values of pixels
at the centers of objects while pushing the values of surrounding pixels closer to zero.
Figure 3.12 provides two examples illustrating the heatmap values at the beginning
and end stages of the training process.
(a) Heatmap in the early stage of training process
(b) Heatmap in the later stage of training process
Figure 3.12: Example of describing heatmap values at the beginning and end of the
training process. Left: Groundtruth (GT) and Right: Prediction (Pred).
DTP provides a target that helps the focal loss in Eq. (3.1) learn more effectively.
28
Since this loss function pays a lot of attention to the center of the object, we use
ˆ
Y
t
to highlight pixels that are likely to be the center. To achieve this, we propose the
Threshold Epoch Adaptor (TEA) to provide a dynamic threshold. Pixels greater
than this threshold are set to 1. For the remaining pixels, we keep them unchanged.
There are two reasons for doing that:
The Gaussian kernel draws concentric circles on heat map target Y ,where
pixels on the same circle have the same probability and decrease as the radius
increases. We want to mimic this characteristic of pseudo-labels in the early
stages of the SSOD process. We hope that by highlighting the pseudo-center
and keeping all other pixels unchanged, we will have a pseudo-heatmap-like.
The new heatmap generated by DTP will then form a mask to guide the
learning of object size and offset. Therefore, information about the size and
offset of the object will be weighted and synthesized from the information of
the surrounding pixels.
TEA: Threshold is dynamic
pivot = λ
1
+ λ
2
t
T
(3.6)
threshold = min(θ, max(pivot, θ
k
)) (3.7)
where:
θ, θ
k
: highest and k-th highest values of
ˆ
Y
t
.
λ
1
, λ
2
: low and high thresholds to control label quality. We default to 0.05
and 0.2 in all our experiments.
t and T : current and total number of epochs.
The confidence scores of predicted boxes will increase during the training process.
Recognizing this characteristic, TEA proposes a dynamic threshold by adapting it
based on the current epoch. Additionally, to enhance the possibility of highlighted
pixels are truly the object center, only the pixels with the top k highest values are
selected.
The dynamic threshold defined in Eq. (3.7) is used to calculate a mask M:
29
heat
xyc
=
1, if p
xyc
> threshold
p
xyc
,otherwise
(3.8)
M
xy
=max
c[1,C]
(heat
xyc
) (3.9)
where p
xyc
denotes score prediction of c-th class at position (x, y, c)on
ˆ
Y
t
.LetN
t
be the number of pixels in heat equal to 1 (which are the object’s centers). We can
prove that 1 N
t
k:
if pivot > θ threshold = θ N
t
= 1, in this case, we keep only the center
with the highest score.
if θ
k
pivot θ threshold = pivot 1 N
t
k, in this case, we keep
all center in top k which have score higher than pivot.
if pivot<θ
k
threshold = θ
k
N
t
= k, in this case, we keep all centers
which score in top k.
Once M is calculated, the objective function of SSOD is:
L
u
= L
uc
+ λ
size
L
us
+ λ
off
L
uo
(3.10)
where unsupervised heatmap loss L
uc
, unsupervised object size loss L
us
and unsu-
pervised offset loss L
uo
are calculated by::
L
uc
= MSE(
ˆ
Y
s
,heat)
L
us
=
xyc
|
ˆ
S
s
M
ˆ
S
t
M|
xyc
M
L
uo
=
xyc
|
ˆ
O
s
M
ˆ
O
t
M|
xyc
M
This design has some advantages:
30
Students can both mimic the heat of the teacher and get more information
about the highlighted central points of the objects. Based on that, Focal loss
in the supervised part in Eq. 3.1 can learn better.
To optimize box sizes and offsets, although only the center points are consid-
ered, these points are not really accurate in the unsupervised part. Therefore,
information about the sizes and offsets of the object is still learned in the pixels
adjacent to the actual center. So we use mask M as a weights matrix, which
makes neighboring pixels adjust learning, avoiding bias in the pseudo-center.
In summary, to mitigate the limitations discussed in the previous section, we have
employed the following strategies:
Avoid lengthy postprocessing steps: To avoid the need for complex and
parameter-intensive postprocessing procedures, we have chosen not to utilize
pseudo-boxes and instead employ the dense output directly generated by the
model as pseudo-labels for the semi-supervised learning process. In this ap-
proach, we still focus on the dense heatmap output and proceed to highlight
pixels that exhibit a high likelihood of representing the object’s center.
Avoid difficulty choosing thresholds: In the pixel highlighting process,
the selection of pixels that are highly probable to be the centers of objects
necessitates the use of a threshold for filtering purposes. Given the chal-
lenges encountered when manually selecting thresholds, we propose a dynamic
thresholding approach. The threshold value at each iteration is determined
based on the current epoch value during the training process and the model’s
performance at that specific iteration. By adopting this strategy, we do not
need to do manual threshold selection for the subsequent postprocessing step
in pseudo-boxes.
Avoid inaccurate pseudo-centers: As previously discussed, the utilization
of pseudo-boxes results in the model learning a pseudo-center, while simultane-
ously shifting the value of the ground truth center towards zero. Even though,
when using the direct utilization of dense output, there remains a high possi-
bility that an arbitrary pixel’s value surpasses that of the ground truth center.
To mitigate this issue, we propose using the highlighted heatmap output as
a weight matrix, preventing the model from solely optimizing the object size
value at the pseudo-center location. Additionally, this approach establishes a
relationship between the heatmap output and the remaining outputs, specifi-
cally the object size and center offset.
31
The subsequent chapter will present a series of experiments that show the effective-
ness of the proposed Dynamic Thresholding and DTP methods.
32
Chapter 4
Experiments
This chapter provides a detailed presentation of the used data, experimental setup,
and results obtained. Based on these outcomes, this research can draw conclusions
regarding the effectiveness of DTP in applying SSOD to the CenterNet++ model.
Section 4.1 introduces the specific dataset utilized in the experiments. Section 4.2
presents a comprehensive summary of all hyperparameter values, augmentation tech-
niques employed, and experimental configurations. Section 4.3 showcases the results
obtained from the conducted experiments. Finally, Section 4.4 presents experiments
designed to evaluate the effectiveness of certain components within this study.
4.1 Dataset
4.1.1 Dataset
We conduct experiments on the PolypsSet dataset [14]. Wang et al have collected
all publicly available endoscopic datasets in the research community, as well as
collected a new dataset from the University of Kansas Medical Center. Below is an
introduction to each dataset.
MICCAI 2017: This dataset is designed for Gastrointestinal Image ANAlysis
(GIANA), a sub-challenge of the Endoscopic Vision Challenge [31]. It contains
18 videos for training and 20 videos for testing. The dataset is only labeled
with polyp masks to test the ability to identify and localize polyps within
images. There are no classification labels in this dataset. Bounding boxes are
converted from the polyp masks and annotated polyp class for each frame.
CVC colon DB: The dataset has 15 short colonoscopy videos with a total of
300 frames [32]. The labels are in the form of segmentation masks, and there
are no classification labels. The same process with MICCAI 2017 is done.
33
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
Figure 4.1: Some examples of images in the PolypsSet dataset
GLRC Dataset: The Gastrointestinal Lesions in Regular Colonoscopy Dataset
(GLRC) contains 76 short video sequences with class labels [33]. There is no
label for polyp location. The bounding box of each polyp is manually anno-
tated frame by frame.
KUMC Dataset: The dataset was collected from the University of Kansas
Medical Center. It contains 80 colonoscopy video sequences. Wang et al
manually labeled the bounding boxes as well as the polyp classes for the entire
dataset.
This dataset has two classes: adenomatous and hyperplastic. The training,
validation, and testing sets contain 28773, 4254, and 4872 images, respectively.
Figure 4.1 describes some data samples belonging to the PolypsSet dataset
The following two experimental settings are studied:
Fully supervised learning: Training on the whole training set and report
performance on the testing set. This fully supervised setup aims to evaluate
the effectiveness of our proposed CenterNet++.
34
Semi-supervised learning: 1%, 5%, and 10% of the training set are sampled
as labeled data, respectively. The rest of the images are treated as unlabeled
data while training.
Note that we keep the validation set and test set untouched for a fair comparison
with other methods.
Besides measuring the performance for the two-class scenario, we conduct additional
experiments that treat both classes as a common polyp class. We call that a single-
class scenario. We use AP50 and AP50:95 as the metrics.
4.1.2 Metrics
Average Precision (AP) is a popular evaluation metric used in object detection
tasks to measure the accuracy of a model’s predictions. AP is computed based on
the definition of Intersection Over Union (IOU), Precision (P), and Recall (R). With
the note that the output of the object detection models is the set of bounding boxes,
we will have:
IOU: is the ratio between the intersection and the union of the areas of two boxes.
Figure 4.2 provides three examples illustrating the IOU values between two boxes.
In this case, an IOU threshold is set, and if the IOU between a predicted box and a
ground truth box exceeds the threshold, the box is considered a correct prediction
(Positive object); otherwise, it is considered a false prediction (Negative object).
Figure 4.2: Examples of IOU between two boxes
Based on that, each sample in the testing set has 3 value
True Positive (TP): is the number of Positive prediction boxes.
False Negative (FP): is the number of Negative prediction boxes.
35
False Positive (FN): is the number of ground-truth boxes which has no corre-
sponding prediction box.
Precision and Recall: are calculated based on TP, FP, and FN above. The
precision with formula P =
TP
TP+FP
represents the proportion of positive objects
among all the predicted objects, while Recall measures the proportion of positive
objects among all the ground truth objects.
Precision represents the confidence level of predictions (the percentage of correctly
predicted bounding boxes). Recall reflects the ability to find good bounding boxes.
These two metrics are always inversely related. A higher Precision indicates that
the model finds fewer bounding boxes but with higher accuracy. This leads to
an increase in FN, as there will be more labels that do not correspond to any
predictions, resulting in a decrease in Recall. Conversely, a decrease in Precision
implies an increase in the number of predictions, meaning more possibilities of labels
corresponding to the predictions, leading to an increase in Recall.
Figure 4.3: Example of Precision-Recall curve. The mean Average Precision (mAP)
is the area under the curve.
36
Precision-Recall Curve: For each class, a precision-recall curve is constructed
by varying the confidence threshold for accepting a detection. Different threshold
values result in different precision and recall values, forming a curve. Figure 4.3
shows an example of a Precision-Recall curve.
Average Precision : The AP is calculated by computing the area under the
precision-recall curve for each class. When comes to multiclass, mAP is the mean
AP of classes. AP or mAP shows the summary of the overall performance of the
model.
In this study, we use AP50 and AP50:95 as metrics where AP50 denotes AP with
an IOU threshold of 50% and AP50:95 is the mean of all AP with IOU thresholds
from 50% to 95%.
4.2 Implement details
We summarize all hyper-parameters used in this study as below:
Supervised loss: We follow exactly on CenterNet original paper. In the focal
loss, α and β are set to 2.0 and 4.0 respectively. With weights of size loss and
offset loss, we set λ
size
=0.1andλ
off
=1.0. Note that that weights are used
in both supervised and semi-supervised steps.
Unsupervised loss: thevaluesoflowerboundthresholdλ
1
and higher bound
threshold λ
2
were set to 0.05 and 0.2, respectively. As the number of polyps
appearing in a frame is small, we set k = 5. With unsupervised loss weight λ
u
,
we set it to 0.1 to avoid unlabeled data affecting the model results too much.
Training: All experiments in both settings used the AdamW optimizer. The
learning rate is initialized by 10
3
and decreased to 10
5
using the Cosine
Annealing schedule. For the fully supervised learning setting, we trained
the model in 100 epochs, while for the semi-supervised learning setting,
we trained the model in 50 epochs. Note that for each epoch of fully super-
vised learning, we iterate through all the data. Meanwhile, for the semi-
supervised learning, we iterate through all the unlabeled data. We use only
a single GPU RTX3090 with 24G memory for both training and evaluation.
The image augmentation techniques also have a significant impact on the perfor-
mance of the model. Selecting appropriate augmentation techniques and combining
them effectively can greatly improve the results. Look at Figure 3.9, we can see
that unlabeled images when fed into the Teacher model are applied some weak aug-
mentations, and then, are applied strong augmentations before being fed into the
37
Student model. With labeled images, they are applied all ”weakly”, and ”strongly”
augmentations, and also Mosaic.
In this study, we denote weak augmentations are operations that do not change
the pixel’s value but change its position. They are Random Scale, Random Rotate,
Horizontal Flip, and Vertical Flip. In contrast, strong augmentations are operations
that only change the pixel’s value. They are ColorJitter, Dropout, GaussianBlur,
and ToGray.
Another used augmentation is Mosaic, it is the first new data augmentation tech-
nique introduced in YOLOv4. This augmentation technique combines 4 training
images into one in certain ratios. This allows the model to learn how to identify
objects at a smaller scale than normal. It also is useful in training to significantly
reduce the need for a large mini-batch size. Figure 4.4 show some examples of
mosaic-augmented image. Mosaic is only used in the supervised part of training
SSOD.
Figure 4.4: Examples of Mosaic augmentations
4.3 Results
In this section, we show the results of our experiments with the two above settings.
Fully supervised learning. In this setup, we compared the CenterNet++ model
with other object detection models in both single-class and two-class scenarios. In
this settings, we compared CenterNet++ with 3 types of detectors:
CenterNet original, with the purpose of verifying the improvement of Center-
Net++.
38
Methods used in the paper of Wang et al [14]. We want to know after a few
years, CenterNet++ which applied some new things can beat the old detectors.
The newest detector, State of the art in the object detection task, is YOLOv8.
4.3.1 CenterNet++ improvement results
In this part, we show the comparisons between the CenterNet++ and the other
detectors. We conduct experiments for both single-class and two-class scenarios.
Table 4.1: Comparison between CenterNet++ and other object detectors for fully
supervised learning on the single-class PolypsSet dataset.
Method Type AP50(%) FPS Params
FasterRCNN [4] two-stage 85.6 16 30.2M
SSD [11]
one-stage 86.3 59 16.8M
RetinaNet [34]
one-stage 87.9 16 32.4M
ATSS [35]
one-stage 88.1 19 41.5M
RefineDet [36]
one-stage 88.5 32 19.8M
CenterNet (base) [8] anchor-free 84.3 24M
Yolov8n [29]
anchor-free 85.2 200 3M
CenterNet++ (Ours)
anchor-free 89.2 222 3.3M
For the single-class scenario, according to Table 4.1, overall, CenterNet++ archives
the highest AP50 points. CenterNet++ has a huge improvement compared to base-
line when it helps to increase by 4.9 AP50. Our model improves by 4.0 AP50
compared to YOLOv8n with the same parameters and by 0.7 AP50 compared to
the best RefineDet model of Wang el at [14]. Not only that, with a very small
number of parameters, CenterNet++ has a very high inference speed, exceeding the
200 FPS of the YOLOv8 model. So in this scenario, Our model is higher than the
others in both accuracy and speed inference.
For the two-class scenario, according to Table 4.2, overall, CenterNet++ although
not the highest AP50, is still the second-best. Our model improves by 3.6 AP50
compared to the baseline and 0.9 AP50 to the Yolov8n model. It achieves compet-
itive accuracy with much larger models. We can conclude that our CenterNet++
model has the best trade-off between accuracy and speed.
Therefore, based on the aforementioned results, we can observe the improvements
achieved by the CenterNet++ model compared to the baseline CenterNet model.
Moreover, in comparison to other models, CenterNet++ demonstrates competitive-
ness and even outperforms them in terms of the trade-off between accuracy and
inference time.
39
Table 4.2: Comparison between CenterNet++ and other object detectors for fully
supervised learning on the two-class PolypsSet dataset.
Method Type
AP50(%)
Params
ad hp mean
FasterRCNN [4] two-stage 72.9 42.5 57.7 30.3M
SSD [11]
one-stage 82.7 52.5 67.6 16.8M
RetinaNet [34]
one-stage 57.9 40.5 49.2 32.4M
ATSS [35]
one-stage 80.7 58.4 69.5 41.5M
RefineDet [36]
one-stage 81.1 65.9 73.5 19.8M
CenterNet (base) [8] anchor-free 79.7 52.9 66.3 24M
Yolov8n [29]
anchor-free 82.8 55.1 69.0 3M
CenterNet++ (Ours)
anchor-free 81.8 57.9 69.9 3.3M
4.3.2 DTP results
In this setup, we evaluate the effectiveness of the Dense Target Producer. Similarly,
we conduct experiments for both single-class and two-class scenarios.
Table 4.3: Experimental results for semi-supervised learning setting on the PolypsSet
dataset.
Class Partial Supervised DTP (Ours)
ad
1% 17.6 41.8
5% 54.9 63.0
10% 66.6 67.1.0
hp
1% 7.1 22.6
5% 30.4 40.6
10% 37.4 45.1
average of ad and hp
1% 12.4 31.0±2.8 (+18.6)
5% 42.7 52.3±4.2 (+9.6)
10% 52.0 55.2±3.0 (+3.2)
single-class
1% 62.4 79.1±0.9 (+16.7)
5% 81.3 84.5±0.9 (+3.2)
10% 83.3 85.8±1.2 (+2.8)
According to Table 4.3, for the single-class scenario, DTP improves the accuracy
of the supervised model by 16.7, 3.2,and3.8 of AP50, when using 1%, 5%, and
10% labeled data, respectively. The corresponding improvements for the two-class
scenario are 18.6, 9.6,and3.2 of AP50.
We compared DTP with other state-of-the-art methods currently available. Based
on Table 4.4, DTP not only helps to increase the performance of the supervised
baseline model but also provides a better level of improvement compared to other
methods. This is due to the excellent accuracy provided by CenterNet++. Fig. 4.5
40
Table 4.4: Comparisons on the single-class PolypsSet dataset with semi-supervised
learning setting using 10% of training data as labeled
.
Methods AP50 AP50:95
CenterNet++ (Fully supervised) 83.0 47.4
Yolov5n (Fully supervised)
77.3 45.8
EfficientTeacher [21] 83.0 49.1
OneTeacher [37]
84,6 50.0
CenterNet++ w/ DTP (Ours)
85.8 50.4
illustrate some examples of when DTP yields better detection results compared to
the supervised baseline.
Based on the above results, we can observe a significant improvement in performance
between DTP and the supervised baseline, demonstrating the successful application
of Semi-Supervised Object Detection (SSOD) to the heatmap-based CenterNet++
model. Furthermore, when compared to other recent methods, DTP also exhibits
superior results, highlighting the effectiveness of the proposed method.
4.4 Ablation Studies
In this chapter, we carry out some experiments with the aim of evaluating the
effectiveness of the proposed modules.
Effectiveness of Backbone. The backbone network architecture plays a crucial
role in the accuracy and speed of the model. We experimented with several archi-
tectures and found that the Yolov8n backbone provides the best trade-off between
accuracy and speed, as shown in Table 4.5.
Table 4.5: Experiments on different backbones of CenterNet++ on the two-class
PolypsSet dataset.
Backbone
AP50(%)
Params
ad hp mean
Resnet50 79.7 52.9 66.3 19M
DenseNet101
73.6 45.1 59.35 22.5M
Yolox-m
76.2 54.2 65.2 22.3M
Yolov8-n
81.8 57.9 69.9 3.3M
Based on Table 4.5 The Yolov8-n backbone achieves the highest AP50 scores for both
classes and also has the highest mean AP50 of 69.9%. Resnet50 comes in second
place with a mean AP50 of 66.3%. Therefore, Yolov8-n stands out as the best-
performing backbone in terms of accuracy, achieving the highest AP50 scores and
mean AP50 while having the lowest number of parameters (3.3 million). Resnet50
41
(a) Miss detection for hyperplastic class
(b) Wrong class detection for hyperplastic class
(c) Wrong class detection for adenomatous class
(d) Miss detection for adenomatous class
Figure 4.5: Visualizations for PolypsSet 10% two-class. From left to right: Without
DTP (supervised), with DTP and label
also demonstrates strong performance, while DenseNet101 and Yolox-m have slightly
lower accuracy scores in comparison. Therefore, considering both accuracy and
parameter efficiency, Yolov8-n appears to be the most favorable choice.
Effectiveness of Mosaic. We evaluated the effectiveness of the Mosaic augmen-
tation and AFS module. According to Table 4.6, adding the Mosaic and AFS leads
to significant improvements (+6 of AP50) in the performance of the model.
When neither Mosaic nor ASF is enabled the model achieves an AP50 of 83.2% and
AP50:95 of 48.7%. This serves as a baseline performance. Using only ASF, the
model’s performance improves slightly. If we use only Mosaic, there is a noticeable
42
Table 4.6: Effectiveness of ASF module and Mosaic augmentations on the single-
class PolypsSet dataset.
Mosaic ASF AP50 AP50:95
83.2 48.7
83.5 49.4
86.9 52.7
89.2 54.5
performance boost. The model achieves an AP50 of 86.9% and AP50:95 of 52.7%
which improves 3.7 AP50 and 4.0 AP50:95 compared to baseline. If both Mosaic
and ASF are enabled, the model exhibits the best performance. It achieves an AP50
of 89.2% and AP50:95 of 54.5%, which are the highest values.
Effectiveness of TEA. Table 4.7 shows that using TEA provides better results
than using fixed thresholds throughout the training process. TEA improves by 2.0
of AP50 compared to the highest AP50 gained by a fixed threshold which is 0.8.
Additionally, using a fixed threshold, in the worst case, DTP with threshold 0.5 still
outperforms the supervised baseline by 2.1 AP50 and 4.8 AP50:95.Itproves
that DTP is more effective than the supervised baseline.
Table 4.7: Effectiveness of using TEA compared to fixed thresholds. Experiments
were conducted on the single-class PolypsSet dataset with 1% data used as labeled.
Threshold AP50 AP50:95
supervised 62.4 28.4
0.1
76.6 38.7
0.2
70.3 37.1
0.3
65.2 33.8
0.4
67.4 34.5
0.5
64.5 33.2
0.6
73.5 37.9
0.7
64.5 33.8
0.8
77.1 40.6
0.9
70.3 36,3
TEA
79.1 42.0
43
Chapter 5
Conclusion
This study proposes a novel SSOD method for heatmap-based end-to-end object
detectors. We propose the Dense Target Producer (DTP) to generate pseudo-labels
for unlabeled data in an end-to-end manner. In addition, we propose the Threshold
Epoch Adaptor (TEA) to control the quality of pseudo-labels dynamically. We
also propose a lightweight yet powerful heatmap-based object detector that yields
competitive results and even surpasses other large object detectors. Experiments on
a large polyp dataset demonstrate the effectiveness of our method. In future work,
we would like to investigate more powerful lightweight backbones, as well as other
mechanisms to improve the quality of pseudo-labels.
44
Bibliography
[1] R. L. Siegel, K. D. Miller, N. S. Wagle, and A. Jemal, “Cancer statistics, 2023,”
CA: A Cancer Journal for Clinicians, vol. 73, no. 1, pp. 17–48, 2023.
[2] N. H. Kim, Y. S. Jung, W. S. Jeong, H.-J. Yang, S.-K. Park, K. Choi, and
D. I. Park, “Miss rate of colorectal neoplastic polyps and risk factors for missed
polyps in consecutive colonoscopies,” Intestinal research, vol. 15, no. 3, pp. 411–
418, 2017.
[3] M. B. Wallace, P. Sharma, P. Bhandari, J. East, G. Antonelli, R. Lorenzetti,
M. Vieth, I. Speranza, M. Spadaccini, M. Desai, et al., “Impact of artificial
intelligence on miss rate of colorectal neoplasia,” Gastroenterology, vol. 163,
no. 1, pp. 295–304, 2022.
[4] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object
detection with region proposal networks,” 2016.
[5] G. Jocher, “YOLOv5 by Ultralytics,” May 2020.
[6] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage
object detection,” in Proceedings of the IEEE/CVF international conference on
computer vision, pp. 9627–9636, 2019.
[7] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet: Keypoint
triplets for object detection,” in Proceedings of the IEEE/CVF international
conference on computer vision, pp. 6569–6578, 2019.
[8] X. Zhou, D. Wang, and P. Kr¨ahenb¨uhl, “Objects as points,” 2019.
[9] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of
simple features,” in Proceedings of the 2001 IEEE computer society conference
on computer vision and pattern recognition. CVPR 2001, vol. 1, pp. I–I, Ieee,
2001.
[10] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings
of the IEEE conference on computer vision and pattern recognition, pp. 7263–
7271, 2017.
45
[11] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
Berg, “Ssd: Single shot multibox detector,” Lecture Notes in Computer Science,
p. 21–37, 2016.
[12] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko,
“End-to-end object detection with transformers,” in European conference on
computer vision, pp. 213–229, Springer, 2020.
[13] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” in
Proceedings of the European conference on computer vision (ECCV), pp. 734–
750, 2018.
[14] K. Li, M. I. Fathan, K. Patel, T. Zhang, C. Zhong, A. Bansal, A. Rastogi, J. S.
Wang, and G. Wang, “Colonoscopy polyp detection and classification: Dataset
creation and comparative evaluations,” Plos one, vol. 16, no. 8, p. e0255809,
2021.
[15] S. Laine and T. Aila, “Temporal ensembling for semi-supervised learning,”
arXiv preprint arXiv:1610.02242, 2016.
[16] V. Verma, K. Kawaguchi, A. Lamb, J. Kannala, A. Solin, Y. Bengio, and
D. Lopez-Paz, “Interpolation consistency training for semi-supervised learn-
ing,” Neural Networks, vol. 145, pp. 90–106, 2022.
[17] E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness, “Pseudo-
labeling and confirmation bias in deep semi-supervised learning,” in 2020 Inter-
national Joint Conference on Neural Networks (IJCNN), pp. 1–8, IEEE, 2020.
[18] D.-H. Lee et al., “Pseudo-label: The simple and efficient semi-supervised learn-
ing method for deep neural networks,” in Workshop on challenges in represen-
tation learning, ICML, vol. 3, p. 896, Atlanta, 2013.
[19] Y.-C. Liu, C.-Y. Ma, Z. He, C.-W. Kuo, K. Chen, P. Zhang, B. Wu, Z. Kira,
and P. Vajda, “Unbiased teacher for semi-supervised object detection,” arXiv
preprint arXiv:2102.09480, 2021.
[20] H. Zhou, Z. Ge, S. Liu, W. Mao, Z. Li, H. Yu, and J. Sun, “Dense teacher:
Dense pseudo-labels for semi-supervised object detection,” in Computer Vision–
ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022,
Proceedings, Part IX, pp. 35–50, Springer, 2022.
[21] B. Xu, M. Chen, W. Guan, and L. Hu, “Efficient teacher: Semi-supervised
object detection for yolov5,” arXiv preprint arXiv:2302.07577, 2023.
46
[22] K. Sohn, Z. Zhang, C.-L. Li, H. Zhang, C.-Y. Lee, and T. Pfister, “A sim-
ple semi-supervised learning framework for object detection,” arXiv preprint
arXiv:2005.04757, 2020.
[23] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu, “End-
to-end semi-supervised object detection with soft teacher,” in Proceedings of
the IEEE/CVF International Conference on Computer Vision, pp. 3060–3069,
2021.
[24] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar,
and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer
Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, Septem-
ber 6-12, 2014, Proceedings, Part V 13, pp. 740–755, Springer, 2014.
[25] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and
A. Zisserman, “The pascal visual object classes challenge: A retrospective,”
International journal of computer vision, vol. 111, pp. 98–136, 2015.
[26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog-
nition,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 770–778, 2016.
[27] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,” in
Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 2403–2412, 2018.
[28] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human
pose estimation,” in Computer Vision–ECCV 2016: 14th European Conference,
Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14,
pp. 483–499, Springer, 2016.
[29] G. Jocher, A. Chaurasia, and J. Qiu, “YOLO by Ultralytics,” Jan. 2023.
[30] M. Liao, Z. Zou, Z. Wan, C. Yao, and X. Bai, “Real-time scene text detection
with differentiable binarization and adaptive scale fusion,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 919–931, 2022.
[31] J. Bernal, N. Tajkbaksh, F. J. Sanchez, B. J. Matuszewski, H. Chen, L. Yu,
Q. Angermann, O. Romain, B. Rustad, I. Balasingham, et al., “Comparative
validation of polyp detection methods in video colonoscopy: results from the
miccai 2015 endoscopic vision challenge,” IEEE transactions on medical imag-
ing, vol. 36, no. 6, pp. 1231–1249, 2017.
47
[32] J. Bernal, J. anchez, and F. Vilarino, “Towards automatic polyp detection
with a polyp appearance model,” Pattern Recognition, vol. 45, no. 9, pp. 3166–
3182, 2012.
[33] P. Mesejo, D. Pizarro, A. Abergel, O. Rouquette, S. Beorchia, L. Poincloux, and
A. Bartoli, “Computer-aided classification of gastrointestinal lesions in regular
colonoscopy,” IEEE transactions on medical imaging, vol. 35, no. 9, pp. 2051–
2063, 2016.
[34] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss for dense
object detection,” in Proceedings of the IEEE international conference on com-
puter vision, pp. 2980–2988, 2017.
[35] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap between
anchor-based and anchor-free detection via adaptive training sample selection,”
in Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pp. 9759–9768, 2020.
[36] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-shot refinement neu-
ral network for object detection,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 4203–4212, 2018.
[37] G. Luo, Y. Zhou, L. Jin, X. Sun, and R. Ji, “Towards end-to-end
semi-supervised learning for one-stage object detection,” arXiv preprint
arXiv:2302.11299, 2023.
48